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Article Info ABSTRACT 

Article history: Coronavirus disease 2019 (COVID-19) has spread throughout the world. The 
detection of this disease is usually carried out using the reverse transcriptase 

Received Nov 13, 2020 polymerase chain reaction (RT-PCR) swab test. However, limited resources 

Revised Jul 19, 2021 became an obstacle to carrying out the massive test. To solve this problem, 

Accepted Aug 5, 2021 computerized tomography (CT) scan images are used as one of the solutions 


to detect the sufferer. This technique has been used by researchers but mostly 
using classifiers that required high resources, such as convolutional neural 
Keywords: network (CNN). In this study, we proposed a way to classify the CT scan 
images by using the more efficient classifier, k-nearest neighbors (KNN), for 


Genetic algorithm images that are processed using a combination of these feature extraction 


Haralick methods, Haralick, histogram, and local binary pattern (LBP). Genetic 
Histogram algorithm is also used for feature selection. The results showed that the 
k-nearest neighbour proposed method was able to improve KNN performance, with the best 
Local binary pattern accuracy of 93.30% for the combination of Haralick and local binary pattern 


feature extraction, and the best area under the curve (AUC) for the 
combination of Haralick, histogram, and local binary pattern with a value of 
0.948. The best accuracy of our models also outperforms CNN by a 4.3% 
margin. 
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1. INTRODUCTION 

Recently, Indonesia is hit by the Coronavirus disease 2019 (COVID-19) pandemic caused by the 
Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2). Since it was first announced by the 
government in March 2020, this virus has continued to spread to various provinces in Indonesia and has 
infected hundreds of thousands of people. South Kalimantan, a province in Indonesia, is one of the areas with 
the highest infection rates in Indonesia. 

One of the factors that caused the high number of patients was the delay in the identification process 
of the reverse transcriptase polymerase chain reaction (RT-PCR) swab test due to the large number of 
specimens that had to be examined by the laboratory. This makes the test results known 14 days after the test 
is carried out. PCR is a sample test by taking samples from places where the virus is most likely to be 
present, such as the back of the nose or mouth or deep in the lungs [1]. The PCR test was also declared by 
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World Health Organization (WHO) as the golden standard for detecting the presence of COVID-19 in 
humans. 

Although known for its effectiveness, PCR testing is not the only way. The computerized 
tomography (CT) scan is more accurate than the PCR swab test in early detection of COVID-19 [2]. Many 
researchers have identified sufferers of COVID-19 through CT scan images such as [3]-[5]. They use the 
convolutional neural network (CNN) method to classify positive and negative COVID-19 patients with an 
accuracy rate of more than 90%. 

CNN is a type of neural network for processing data that has a network-like topology [6]. CNN is 
widely used in computer vision, as is done by [7]-[10]. Despite having various advantages, CNN is a method 
that requires enormous computational resources [11]. However, in machine learning there are still many other 
classification algorithms that can be used with low resources, one of which is k-nearest neighbors (KNN). 

The KNN algorithm was formulated by performing a non-parametric method for pattern 
classification [12]. KNN also stated as a simple but effective algorithm for several cases [13]. The success of 
the KNN algorithm depends on selecting the correct k value. In this study, we used the KNN to identify 
sufferers of COVID-19 based on CT scan images. The images was collected from Tongji Hospital in Wuhan, 
China [4]. 

Before the data mining process is carried out, the obtained CT scan image is extracted based on 
texture to obtain its characteristic values. Feature extraction in images is divided into several categories, 
namely, color, texture, and shape [14]. Texture-based feature extraction is known to have advantages, namely 
it has low computational complexity and is easy to implement [15]. The feature extraction methods used in 
this study are Haralick, local binary pattern (LBP), and 32-bin histogram. 

One of the challenges in this study is that the feature extraction results in each method have a large 
number of features. This can cause a curse of dimensionality (CoD) which leads to a high time complexity 
problem [16]. CoD may also decrease the accuracy generated by the algorithm. To overcome this 
weakness, Sayed et al. [17] suggested the use of feature selection. Feature selection is a method for selecting 
the most relevant features from a dataset. Reducing the data dimension would also result in performance 
improvement in many cases. 

One type of feature selection is wrapper [18]. The wrapper uses machine learning to run through all 
possible feature combinations, then selects the combination that produces the best performance. The wrapper 
method determines the best feature combination by comparing the evaluation criteria determined from 
various feature combinations, then from the comparison results select the feature combination that has the 
most optimal results. One algorithm that can be used to perform wrapper-based feature selection is the 
genetic algorithm (GA), as has been done by [19]-[21]. 

In this study, we proposed a method to improve KNN classification in the computer vision field. The 
genetic algorithm was used as the feature selection in classifying COVID-19 sufferers through CT Scan 
images extracted using the Haralick method, 32-bin histogram, and local binary pattern. After getting the 
classification results of the proposed method, we compared them by the best results in the previous work [4]. 
They used the CNN DenseNet-169 and ResNet-50 architectures. Before entering the CNN process, they did 
pre-process step by resizing the image to 480x480. An image segmentation process is also carried out to 
improve accuracy. Meanwhile, in our method, we process the image directly into the feature extraction, 
without resize and segmentation. 


2. RESEARCH METHOD 

Our research was carried out as in Figure 1. In this study, the Coelho [22] library for Python was 
used to perform feature extraction. Meanwhile, to perform feature selection and classification, the 
RapidMiner [23] software is used. 


2.1. Dataset 

The dataset used in this study is the CT scan dataset compiled by [4]. There are 746 grayscale 
images consisting of 349 CT scans of COVID-19 patients and 397 non-COVID-19 patients. The images are 
vary in * .jpg and * .png formats. 


2.2. Feature extraction 

Before entering the classification stage, the features of the downloaded dataset are extracted using 
Haralick method, 32-bin histogram, and local binary pattern. At this stage each image is converted into a 
number of matrix. The first feature extraction is Haralick. This feature contains information about the 
intensity of the image in pixels with certain positions in relation to each other occurring simultaneously [24]. 
This method calculates its feature value from 8 angles, namely 0, 45, 90, 135, 180, 225, 270, 315, 360. Each 
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angle produces 14 features using the formula in Table 1. Thus, each image will produce 8x14 features. Then, 
simplification is carried out by calculating the average value of each corner feature in each image, so that the 
number of features for each image becomes 14 features. The formula in Tabel 1 is formed from the gray-tone 
spatial-dependency matrix of an image and defined by [24] as follows: 

- p(i, j): (i, j)th entry in the matrice; 

- px(i): the entry in the marginal-probability matrix obtained by summing the rows of pi j); 

- Ng: the number of distinct gray levels in the quantized image. 

The next extraction method is histogram. An image histogram is the intensity of the number of 
pixels which is formed in graphical format. The values formed in each image are grouped into 32 value 
ranges. So, in this step 32 features are generated. The third is local binary pattern. This method is a simple 
but very efficient texture operator that labels image pixels by limiting the environment of each pixel and 
considers the result to be a binary number [25]. Then, the label histogram can be used as a texture descriptor. 
This method produced 25 features. 


Collecting the Dataset 






Extract image feature Extract image feature 
using Haralick using Histogram 32-bin 






Extract image feature using 
Local Binary Pattern 


Combine extracted feature to 
generate new dataset 









Feature Selection using 
Genetic Algorithm 
Classify using kNN Classify using kNN 

Evaluate the model Evaluate the model 
Select the best Select the best 
accuracy accuracy 


Test the models 
significance 


e) 
Figure 1. Proposed method abstraction design 
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2.3. Generate new dataset 

This stage is the formation of a new dataset by combining the features formed in 2.2. At this stage, 7 
new datasets are generated which are described in Table 2. Every dataset has different dimension depends on 
its feature extraction method. 


Table 2. Detail of new datasets 


Dataset Num of Feature 
Har 14 Formed from Haralick extraction 
Hist32 32 Formed from Histogram extraction 
LBP 25 Formed from Local Binary Pattern extraction 
Har+Hist32 46 Combination of Haralick & Histogram 32bin 
Har+LBP 39 Combination of Haralick & Local Binary Pattern 
Hist32+LBP 57 Combination of Histogram 32bin & Local Binary Pattern 
Har+Hist32+LBP 71 Combination of Haralick, Histogram 32bin, & Local Binary Pattern 


2.4. Classification and cross validation 

At this stage, the dataset formed in Table 2 is classified using the KNN algorithm and validated 
using 10-fold cross validation. The KNN classification is carried out with a value of k=2 to k=17. The value 
of k=1 was not included because of the high variance [26]. 


2.4.1. Classification without feature selection (KNN Only) 
This classification involves all the features that are formed from Table 2. Here, we do not select the 
features yet. Later, the accuracy of the KNN classifier will be compared to the accuracy of GA+KNN. 


2.4.2. Classification using genetic algorithm feature selection (GA+KNN) 

Each dataset in Table 2 is created a new data subset containing only the features selected by the 
genetic algorithm. This algorithm works as follows [27]: 1) Step 1: Initialize random individual populations; 
11) Step 2: Assign fitness values for each individual in the population; 111) Step 3: Make individual selections 
on the population to create new generation; iv) Step 4: Perform crossovers on the selected individuals; 
v) Step 5: Perform mutations to avoid similarity in the generation of results crossover and parent population; 
vi) Step 6: Repeat step 2-5 until the stop criteria are met. 


2.5. Evaluation 

At this stage, the performance of the KNN algorithm is evaluated based on its accuracy and area 
under the curve (AUC). The higher the accuracy value, the better the performance of the model. This rule 
also applied to the AUC value. 


2.6. Significance test 
At this stage, we use the t-test method. This method was applied to test the significance of each of 
the best values produced by KNN and GA+KNN for each dataset in Table 2. With alpha value=0.05, means 
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that the significance value of KNN and GA+KNN is less than 0.05 (p-value <a) indicates the two models can 
be said to be significantly different. 


3. RESULT AND ANALYSIS 

At this stage, the best accuracy for the k-NN model is compared with the best for the GA+KNN 
model. Then, to show that the best accuracy of the two models has a statistically significant difference, a 
different test is performed using the t-test. The test results can be seen in Table 3. From the test, we can see 
that, although not all produce significant differences, the results obtained are the proposed method 
(GA+KNN) outperforming KNN. 

Genetic algorithm has been shown to improve KNN classification accuracy in images extracted with 
Haralick, LBP, and Har+LBP. The best overall accuracy results were achieved by GA+KNN on the 
Haralick+LBP feature extraction. Also, the best AUC value is generated by GA+KNN on Haralick+32-bin 
histogram+LBP dataset. To determine the effectiveness of the model, we compare the best accuracy 
GA+KNN with the CNN model produced by [4] as in Table 4. Yang model excels in the AUC score, while 
the our proposed model is superior in terms of accuracy. 


Table 3. Results of the best accuracy and its t-test 
No Dataset Best Accuracy KNN Best Accuracy GA+KNN _ Best Accuracy 
k Accuracy AUC k Accuracy AUC t-Test (a= 0.05) 
5 79.10% 0.835 13 83.70% 0.896 Significant 
Hist32 2 90.90% 0.936 2 91.80% 0.935 Not Significant 
LBP 4 81.00% 0.859 89.00% 0.901 Significant 
Har+Hist32 2 90.62% 0.933 92.23% 0.937 Not Significant 
2 
2 
2 


— 


Har 


Har+LBP 84.60% 0.868 93.30% 0.937 Significant 
Hist32+LBP 91.55% 0.942 92.63% 0.94 Not Significant 
Har+Hist32+LBP 91.16% 0.926 92.76% 0.948 Not Significant 


SIA” BWN KR 
NNNN VN 


Table 4. GA+KNN comparison against CNN Yang’s model 
Yang, et al [4] Proposed Model (GA+KNN) 
Accuracy AUC Accuracy AUC 
89% 0.98 93.30% 0.937 


4. CONCLUSION 

The research proved that the model built using the genetic algorithm (GA+KNN) combined with 
Haralick and local binary pattern was able to improve the performance of the KNN only classification 
algorithm and produce the best accuracy with a value of 93.30% and AUC of 0.937. The machine learning 
model produced is also able to provide excellent results by outperforms the CNN Yang model, which was 
formed by the identical dataset. 
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